The Real Problem
One of the key activities of any IT function is to “Keep the lights on” to ensure there is no impact to the Business operations. IT leverages Incident Management process to achieve the above Objective. An incident is something that is unplanned interruption to an IT service or reduction in the quality of an IT service that affects the Users and the Business. The main goal of Incident Management process is to provide a quick fix / workarounds or solutions that resolves the interruption and restores the service to its full capacity to ensure no business impact. In most of the organizations, incidents are created by various Business and IT Users, End Users/ Vendors if they have access to ticketing systems, and from the integrated monitoring systems and tools. Assigning the incidents to the appropriate person or unit in the support team has critical importance to provide improved user satisfaction while ensuring better allocation of support resources. The assignment of incidents to appropriate IT groups is still a manual process in many of the IT organizations. Manual assignment of incidents is time consuming and requires human efforts. There may be mistakes due to human errors and resource consumption is carried out ineffectively because of the misaddressing. On the other hand, manual assignment increases the response and resolution times which result in user satisfaction deterioration / poor customer service.
Business Domain Value
In the support process, incoming incidents are analyzed and assessed by organization’s support teams to fulfill the request. In many organizations, better allocation and effective usage of the valuable support resources will directly result in substantial cost savings. Currently the incidents are created by various stakeholders (Business Users, IT Users and Monitoring Tools) within IT Service Management Tool and are assigned to Service Desk teams (L1 / L2 teams). This team will review the incidents for right ticket categorization, priorities and then carry out initial diagnosis to see if they can resolve. Around ~54% of the incidents are resolved by L1 / L2 teams. Incase L1 / L2 is unable to resolve, they will then escalate / assign the tickets to Functional teams from Applications and Infrastructure (L3 teams). Some portions of incidents are directly assigned to L3 teams by either Monitoring tools or Callers / Requestors. L3 teams will carry out detailed diagnosis and resolve the incidents. Around ~56% of incidents are resolved by Functional / L3 teams. Incase if vendor support is needed, they will reach out for their support towards incident closure. L1 / L2 needs to spend time reviewing Standard Operating Procedures (SOPs) before assigning to Functional teams (Minimum ~25-30% of incidents needs to be reviewed for SOPs before ticket assignment). 15 min is being spent for SOP review for each incident. Minimum of ~1 FTE effort needed only for incident assignment to L3 teams. During the process of incident assignments by L1 / L2 teams to functional groups, there were multiple instances of incidents getting assigned to wrong functional groups. Around ~25% of Incidents are wrongly assigned to functional teams. Additional effort needed for Functional teams to re-assign to right functional groups. During this process, some of the incidents are in queue and not addressed timely resulting in poor customer service. Guided by powerful AI techniques that can classify incidents to right functional groups can help organizations to reduce the resolving time of the issue and can focus on more productive tasks.
In this capstone project, the goal is to build a classifier that can classify the tickets by analyzing text. Details about the data and dataset files are given in below link, https://drive.google.com/open?id=1OZNJm81JXucV3HmZroMq6qCT2m7ez7IJ
pip install googletrans
pip install pyLDAvis
!pip install Unidecode
from google.colab import drive
drive.mount('/content/drive/')
#Set your project path
project_path = '/content/drive/My Drive/Colab Notebooks/Capstone Working copy/'
Excel_data_file = project_path + "input_data.xlsx"
print(Excel_data_file)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import tensorflow as tf
import keras
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.callbacks import EarlyStopping
from sklearn.feature_extraction.text import CountVectorizer
#from keras.utils.np_utils import to_categorical
%matplotlib inline
import re
from tensorflow.keras.preprocessing.text import Tokenizer as KerasTokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Model
from tensorflow.keras.models import Sequential
from dateutil import parser
from wordcloud import WordCloud, STOPWORDS
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models.ldamodel import LdaModel
from gensim.models import CoherenceModel
import spacy
import pyLDAvis
import pyLDAvis.gensim
import googletrans
from googletrans import Translator
import warnings
from gensim.models import Word2Vec
from tensorflow.keras.callbacks import ModelCheckpoint, ReduceLROnPlateau
from tensorflow.keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation, Flatten, Bidirectional, GlobalMaxPool1D,GRU,Conv1D,MaxPooling1D
from sklearn import metrics
from tensorflow.keras import backend as K
from tensorflow.keras.utils import plot_model
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier,AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfTransformer
import random
import copy
import time
import gc
import torch
from torchtext import data
from tqdm import tqdm_notebook, tnrange
from tqdm.auto import tqdm
tqdm.pandas(desc='Progress')
from collections import Counter
from textblob import TextBlob
from nltk import word_tokenize
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence
from torch.autograd import Variable
from torchtext.data import Example
import torchtext
import os
# cross validation and metrics
from sklearn.model_selection import StratifiedKFold
from torch.optim.optimizer import Optimizer
from unidecode import unidecode
from sklearn.preprocessing import StandardScaler
from multiprocessing import Pool
from functools import partial
from sklearn.decomposition import PCA
import torch as t
from numpy.random import RandomState
import logging
#from fastai.text import *
print(tf.__version__)
warnings.filterwarnings("ignore")
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' # FATAL
logging.getLogger('tensorflow').setLevel(logging.FATAL)
from nltk.corpus import stopwords
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
stop = set(stopwords.words('english'))
from IPython.display import display
data = pd.read_excel(Excel_data_file)
data.head()
data.info()
Observations:
The given data has a total of 8500 entries and 4 columns:
Column 'Assignment group' is our target/independent variable Y where as the other 3 columns are dependent variables X. All the columns are of datatype object. Also 8 records in Short description have null value and 1 record in Description have null value. We will see this in detail below.
data.shape
data.columns
data.isna().sum()
Observations:
The data has 8500 records/articles and 4 columns. There are null records in the data as seen above.Remove the 9 records that had null values.
#Dropping the null values
data.dropna(inplace=True)
data.shape
Removed the 9 records with null values since it does not add any value in our model building and prediction
data['Assignment group'].nunique()
data['Assignment group'].unique()
Totally there are 74 unique assignment groups from GRP_0 to GRP_73
data.describe()
data['Assignment group'].value_counts()
group_count = data['Assignment group'].value_counts()
group_count.describe()
sns.set_style("dark")
descending_order = data['Assignment group'].value_counts().sort_values(ascending=False).index
plt.subplots(figsize=(22,5))
ax=sns.countplot(x='Assignment group', data=data,order=descending_order)
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha="right")
plt.title("Count plot by Group")
plt.show()
As seen from the graph above, the given data is highly skewed with GRP_0 having more than 45% of the records. The other groups with high records are GRP_8, GRP_24, GRP_12, GRP_9. There are many groups with only 1 record which leads to bias in the prediction.
data['Caller'].nunique()
data['Caller'].value_counts()
There are only 2948 callers in total for 8491 records which means that each caller may have raised one or more than one ticket.
caller_count = data['Caller'].value_counts()
caller_count.describe()
CallerGrp = data.groupby(['Caller'])
LeastDataCaller=[]
for grp in data['Caller'].unique():
if(CallerGrp.get_group(grp).shape[0]<2):
LeastDataCaller.append(grp)
print('Number of Callers who made only 1 call: ', (len(LeastDataCaller)))
Observations:
From the above data, it is clearly visible that:
#Find duplicate records in the given data
duplicate_data = data[data.duplicated()]
print(len(duplicate_data))
duplicate_data.head()
Looks like there are 83 records with duplicate values and this needs to be cleaned.
#Removing those duplicates
data = data.drop_duplicates()
data.shape
All the 83 duplicate records have been removed from the data. Now lets continue with Translation.
df_before_translation = data.copy() ##Taking backup before translation
df_before_translation.tail()
Since the data has lot of german text and also non-ascii characters, we need to translate all the data to English language. Hence using the google translator API to translate the given data for those records that are not in English language.
#Translate to english if the given sentence is not in english.
def Translate_to_English(x):
translator = Translator()
if translator.detect(x).lang != 'en':
#print("Source: ", x)
translatedText = translator.translate(x).text
#print("Translated text in English: ", translatedText)
else:
translatedText = x
return translatedText
#Translate the description column and short description column
for i in data.index:
data['Description'][i] = Translate_to_English(str(data['Description'][i]))
data['Short description'][i] = Translate_to_English(str(data['Short description'][i]))
data.tail()
##Saved a copy of translated text as csv in drive
data.to_excel(project_path + 'Translated_Data.xlsx')
#Making a copy of the data and then cleaning the data
df_translated_before_cleaning = data.copy()
data.shape
#Having callers unique list separately to remove the caller names from description later.
callers = data['Caller'].unique()
## merging the Short Description and Description Columns
new_data= pd.DataFrame({"Description": data["Short description"] + " " + data["Description"], "AssignmentGroup": data["Assignment group"]}, columns=["Description","AssignmentGroup"])
new_data.head()
new_data.shape
new_data.isna().sum()
The description has lots of unwanted characters/words like mail addresses, numbers, special characters, disclaimer messages etc. So we will clean them first.
#This function is to remove the disclaimer messages given as part of the email sent by callers which is not needed
def Remove_Disclaimer(text):
text = str(text)
strDisclaimerMsg1 = r'this communication is intended solely for the use of the addressee and may contain information that is sensitive, confidential or excluded from disclosure in accordance with applicable law. it is strictly forbidden to distribute, distribute or reproduce this communication by anyone other than the intended recipient. if you have received this message by mistake, please notify the sender and delete this message.'
strDisclaimerMsg2 = r'select the following link to view the disclaimer in an alternate language.'
#to remove the pattern '[ # + company / posts> ['
strDisclaimerMsg3 = r'\[.*?\['
strDisclaimerMsg4 = r'this message is intended for the exclusive use of the person to whom it is addressed and may contain privileged, confidential information that is exempt from disclosure in accordance with the provisions of current legislation. any dissemination, distribution or reproduction of this message by someone other than the intended recipient is strictly prohibited. if you receive this message in error, please notify the sender and delete this message.'
#text=text.lower()
text = re.sub(strDisclaimerMsg1, ' ',str(text))
text = re.sub(strDisclaimerMsg2, ' ',str(text))
text = re.sub(strDisclaimerMsg3, ' ',str(text))
text = re.sub(strDisclaimerMsg4, ' ',str(text))
text = text.strip()
return text
#Replace known formats with proper strings for better prediction
def preprocess_replace(text1):
text1=text1.replace(to_replace='[Hh][Oo][sS][Tt][nN][Aa][Mm][Ee]_[0-9]*',value='hostname ',regex = True)
text1=text1.replace(to_replace='ftp*.*', value='ftp location ', regex=True)
text1=text1.replace(to_replace='[a-z0-9_]*.xlsx',value='excel ', regex=True)
text1=text1.replace(to_replace='outside:[0-9./]*',value='outside ipaddress ',regex=True)
text1=text1.replace(to_replace='inside:[0-9./]*',value='inside ipaddress ',regex=True)
text1=text1.replace(to_replace='\\*hostname[_0-9]*',value='hostname ',regex=True)
text1=text1.replace(to_replace='lmsl[0-9]*',value='lmsl ',regex=True)
text1=text1.replace(to_replace='SID_[0-9][0-9]',value='sid ',regex = True)
text1=text1.replace(to_replace='[Tt]icket_no[0-9]*',value='ticket_no ', regex=True)
text1=text1.replace(to_replace='[jJ]ob_[0-9_a-z]*',value='job_id ',regex=True)
text1=text1.replace(to_replace='[0-9]+.[0-9]+.[0-9]+.[0-9]+[/0-9]*',value='ipaddress ',regex=True)
return text1
replace() function requires dataframe to be passed as such and not str format. Hence the call is outside of clean_data function
new_data['Description'] = preprocess_replace(new_data['Description'])
new_data.head()
def is_valid_date(date_str):
try:
parser.parse(date_str)
return True
except:
return False
def clean_data(text):
text=text.lower()
text = ' '.join([w for w in text.split() if not is_valid_date(w)])
text = Remove_Disclaimer(text)
text = re.sub(r"received from:",' ',text)
text = re.sub(r"from:",' ',text)
text = re.sub(r"to:",' ',text)
text = re.sub(r"subject:",' ',text)
text = re.sub(r"sent:",' ',text)
text = re.sub(r"ic:",' ',text)
text = re.sub(r"cc:",' ',text)
text = re.sub(r"bcc:",' ',text)
#Remove email
text = re.sub(r'\S*@\S*\s?', '', text)
# Remove numbers
text = re.sub(r'\d+',' ' ,text)
# Remove new line characters
text = re.sub(r'\n',' ',text)
# Remove hashtag while keeping hashtag text
text = re.sub(r'#',' ', text)
#&
text = re.sub(r'&;?', 'and',text)
# Remove HTML special entities (e.g. &)
text = re.sub(r'\&\w*;', '', text)
# Remove hyperlinks
text = re.sub(r'https?:\/\/.*\/\w*', '', text)
# Remove characters beyond Readable formart by Unicode:
text= ''.join(c for c in text if c <= '\uFFFF')
text = text.strip()
# Remove unreadable characters (also extra spaces)
text = ' '.join(re.sub("[^\u0030-\u0039\u0041-\u005a\u0061-\u007a]", " ", text).split())
for name in callers:
namelist = [part for part in name.split()]
for namepart in namelist:
text = text.replace(namepart,'')
text = re.sub(r"\s+[a-zA-Z]\s+", ' ', text)
text = re.sub(' +', ' ', text)
text = text.strip()
return text
#Cleaning the data and applying the regular expression rules
new_data['Description'] = new_data['Description'].apply(clean_data)
new_data.head()
new_data.to_excel(project_path + "CleanedData.xlsx")
As you can see, all the unwanted characters/words are removed from the description column
As part of NLP data procesing, it is important to do Lemmatization, tokenization and stop words removal.
lemmatizer = WordNetLemmatizer()
# function to convert nltk tag to wordnet tag
def nltk_tag_to_wordnet_tag(nltk_tag):
if nltk_tag.startswith('J'):
return wordnet.ADJ
elif nltk_tag.startswith('V'):
return wordnet.VERB
elif nltk_tag.startswith('N'):
return wordnet.NOUN
elif nltk_tag.startswith('R'):
return wordnet.ADV
else:
return None
# Function to convert array into string
def listToString(s):
str1 = " "
return (str1.join(s))
def lemmatize_sentence(sentence):
#tokenize the sentence and find the POS tag for each token
nltk_tokenized = nltk.word_tokenize(sentence)
ordered_tokens = set()
result = []
#remove duplicate words in sentence
for word in nltk_tokenized:
if word not in ordered_tokens:
ordered_tokens.add(word)
result.append(word)
new_desc = listToString(result)
nltk_tagged = nltk.pos_tag(ordered_tokens)
#tuple of (token, wordnet_tag)
wordnet_tagged = map(lambda x: (x[0], nltk_tag_to_wordnet_tag(x[1])), nltk_tagged)
lemmatized_sentence = []
for word, tag in wordnet_tagged:
if tag is None:
#if there is no available tag, append the token as is
lemmatized_sentence.append(word)
else:
#else use the tag to lemmatize the token
lemmatized_sentence.append(lemmatizer.lemmatize(word, tag))
return " ".join(lemmatized_sentence),new_desc
temp =[]
temp1 = []
for sentence in new_data["Description"]:
sentence = sentence.lower()
cleanr = re.compile('<.*?>')
sentence = re.sub(cleanr, ' ', sentence) #Removing HTML tags
sentence = re.sub(r'\S+@\S+', 'EmailId', sentence)
sentence = re.sub(r'\'', '', sentence, re.I|re.A)
sentence = re.sub(r'[0-9]', '', sentence, re.I|re.A)
sentence = re.sub(r'[^a-zA-Z0-9\s]', ' ', sentence)
sentence = sentence.lower()
sentence = re.sub(r'com ', ' ', sentence, re.I|re.A)
sentence = re.sub(r'hello team', ' ', sentence, re.I|re.A)
sentence = re.sub(r'hello ', ' ', sentence, re.I|re.A)
sentence = re.sub(r'hi team', ' ', sentence, re.I|re.A)
sentence = re.sub(r'hi', ' ', sentence, re.I|re.A)
sentence = re.sub(r'hello team', ' ', sentence, re.I|re.A)
sentence = re.sub(r'best', ' ', sentence, re.I|re.A)
sentence = re.sub(r'kind', ' ', sentence, re.I|re.A)
sentence = re.sub(r'hello helpdesk', ' ', sentence, re.I|re.A)
sentence = re.sub(r'good morning ', ' ', sentence, re.I|re.A)
sentence = re.sub(r'good afternoon ', ' ', sentence, re.I|re.A)
sentence = re.sub(r'good evening ', ' ', sentence, re.I|re.A)
l_sentence,new_desc = lemmatize_sentence(sentence)
words = [word for word in l_sentence.split() if word not in stopwords.words('english')]
descWords = [word for word in new_desc.split() if word not in stopwords.words('english')]
temp.append(words)
temp1.append(listToString(descWords))
#Add the corrected description and bag of words in the data frame
new_data['BagOfWords'] = temp
new_data['NewDescription'] = temp1
new_data.head()
#Skip if any description is null or empty
final_data = new_data[new_data['NewDescription'].map(lambda d: len(d)) > 0]
final_data.shape
print(pd.Series({c: final_data[c].map(lambda x: len(str(x))).max() for c in final_data}).sort_values(ascending =False))
The max length for description after all the clean up is 3249 characters.
# calculate the length(number of characters) and number of words in every record and add it to the dataframe
final_data['length']=[len(text) for text in final_data['NewDescription']]
final_data['num_words'] = final_data['NewDescription'].apply(lambda x : len(x.split()))
final_data.head()
final_data.shape
#Copy the data and take the records that has only more that 3 characters for final data
final_data1 = final_data.copy()
final_data1=final_data1[final_data1['length']>=3]
final_data1.head()
final_data1.drop(['Description'], axis=1,inplace=True)
final_data1.rename(columns = {'NewDescription':'Description'}, inplace = True)
final_data1 = final_data1[['Description','BagOfWords','length','num_words','AssignmentGroup']]
final_data1.head()
final_data1.shape
final_data1.describe().transpose()
bins=[0,50,75, np.inf]
final_data1['bins']=pd.cut(final_data1.num_words, bins=[0,30,50,100,300, np.inf], labels=['0-30', '30-50', '50-100','100-300' ,'>300'])
word_distribution = final_data1.groupby('bins').size().reset_index().rename(columns={0:'counts'})
word_distribution
As seen above, the max character length for the description column in 3249 and max no. of words is 426. There are no records that have greater than 500 words. Almost 98% of the records have words in the range of 1-30 which means most of the customers have given only short description on their issue.
sns.barplot(x='bins', y='counts', data=word_distribution).set_title("Word distribution per bin")
final_data1.to_excel(project_path + "FinalData.xlsx")
Combining the groups that has less than 10 samples into a single group called LeastDataGroup. This is mainly done because, as seen in the EDA, there are 25 groups that had less than 10 samples only and this will not help in predictions. So combining them into one group will help to categorize/classify the tickets correctly.
We will see how the two data (1. Data without grouping, 2. Data with grouping) differ in the performance while modeling the data.
#Take a copy
Final_Data_Grouped = final_data1.copy()
AssignmentGrp = Final_Data_Grouped.groupby(['AssignmentGroup'])
LeastDataGroup=[]
for grp in Final_Data_Grouped['AssignmentGroup'].unique():
if(AssignmentGrp.get_group(grp).shape[0]<10):
LeastDataGroup.append(grp)
print('Number of groups that has less than 10 samples: ', (len(LeastDataGroup)))
Final_Data_Grouped['AssignmentGroup']=Final_Data_Grouped['AssignmentGroup'].apply(lambda x : 'least_data_grp' if x in LeastDataGroup else x)
Final_Data_Grouped['AssignmentGroup'].nunique()
Final_Data_Grouped['AssignmentGroup'].unique()
As you can see, the uniqueness count is reduced from 74 to 50 groups now for Assignment Group. The regrouping is done because with very less samples (each group having 1 or 2 records) might not help us in proper classification and that can impact our accuracy. So we will group such data into 1 group (Least data group/miscellaneous group) that can help us improve the accuracy.
We will build a bigram and trigram model and see how the words are co-occuring together in the given data
#Split the given sentence to individual words
def SplitSentenceToWords(sentences):
for sentence in sentences:
yield(gensim.utils.simple_preprocess(str(sentence), deacc=True)) # deacc=True removes punctuations
#Collect the bag of words
bag_of_words = list(SplitSentenceToWords(final_data1['Description']))
print(bag_of_words[1])
# define the bigram and trigram models (higher threshold is used to have fewer phrases)
bigram = gensim.models.Phrases(bag_of_words, min_count=5, threshold=100)
trigram = gensim.models.Phrases(bigram[bag_of_words], threshold=100)
# Faster way to get a sentence clubbed as a trigram/bigram
bigram_model = gensim.models.phrases.Phraser(bigram)
trigram_model = gensim.models.phrases.Phraser(trigram)
#Build bigram and trigram model for given data
def Build_Bigrams(texts):
return [bigram_model[doc] for doc in texts]
def Build_Trigrams(texts):
return [trigram_model[doc] for doc in texts]
# Form Bigrams
data_words_bigrams = Build_Bigrams(bag_of_words)
print(data_words_bigrams)
#Form trigrams
data_words_trigrams = Build_Trigrams(data_words_bigrams)
print(data_words_trigrams)
You can see the sample bigram and trigram data as seen above.
Building word cloud gives a very good representation of the frequent words occuring together and further actions to be taken accordingly
#Build word cloud for Bigrams Model
wordclouds=' '.join(map(str, data_words_bigrams))
wordcloud = WordCloud(width=480, height=480, max_font_size=20, min_font_size=10).generate(wordclouds)
plt.figure(figsize=(20,10))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.margins(x=0, y=0)
plt.show()
wordcloud_trigram=' '.join(map(str, data_words_trigrams))
wordcloud = WordCloud(width=480, height=480, max_words=100).generate(wordcloud_trigram)
plt.figure(figsize=(20,10))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.margins(x=0, y=0)
plt.show()
Let's see the word cloud for different groups. This will give a good representation on the maximum usage of words in every group and helps better in classification
stopwords = set(STOPWORDS)
## function to create Word Cloud
def show_wordcloud(data, title):
wordcloud = WordCloud(
background_color='white',
stopwords=stopwords,
max_words=100,
max_font_size=40,
scale=3,
random_state=1 # chosen at random by flipping a coin; it was heads
).generate(str(data))
fig = plt.figure(1, figsize=(10, 5))
plt.axis('off')
#fig.title("Top 100 words of {}".format(title))
if title:
fig.suptitle("Top 100 words of {}".format(title), fontsize=50, color='blue', fontweight = 'bold')
#fig.subplots_adjust(top=2.3)
plt.imshow(wordcloud)
plt.show()
#Sorting based on frequency of target class Assignment group
value = final_data1['AssignmentGroup'].value_counts().sort_values(ascending=False).index
value
print(len(value))
There are totally 74 groups and we will see the top 100 words of top 3 groups
group = ['GRP_0','GRP_8', 'GRP_24' ]
for i in range(len(group)):
CloudGrp = final_data1[final_data1['AssignmentGroup'] == group[i]]
CloudGrp = CloudGrp['BagOfWords']
show_wordcloud(CloudGrp,group[i])
Observations:
# Create Dictionary
id2word = corpora.Dictionary(final_data1['BagOfWords'])
# Create Corpus from post clean data
texts = final_data1['BagOfWords']
# Term Document Frequency and Bag of words
corpus = [id2word.doc2bow(text) for text in texts]
# View as ID
print(corpus[:1])
# View as word
print([[(id2word[id], freq) for id, freq in cp] for cp in corpus[:1]])
For our modeling and train_test_split, we will use both the data:
and see how the performance/accuracy of the model differs and which model works better.
#save the 2 different data frames in excel file.
final_data1.to_excel(project_path + "FinalDataUngrouped.xlsx")
Final_Data_Grouped.to_excel(project_path + "FinalDataGrouped.xlsx")
# Split into Test, train, validation for Ungrouped data
train_ratio = 0.60
validation_ratio = 0.20
test_ratio = 0.20
final_data1.head()
X = final_data1['Description']
y = final_data1['AssignmentGroup']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1 - train_ratio)
X_val, X_test, y_val, y_test = train_test_split(X_test, y_test, test_size=test_ratio/(test_ratio + validation_ratio))
# Split into Test, train, validation for Grouped data
train_ratio = 0.60
validation_ratio = 0.20
test_ratio = 0.20
Final_Data_Grouped.head()
X_Grped = Final_Data_Grouped['Description']
Y_Grped = Final_Data_Grouped['AssignmentGroup']
X_train_Grped, X_test_Grped, y_train_Grped, y_test_Grped = train_test_split(X_Grped, Y_Grped, test_size=1 - train_ratio)
X_val_Grped, X_test_Grped, y_val_Grped, y_test_Grped = train_test_split(X_test_Grped, y_test_Grped, test_size=test_ratio/(test_ratio + validation_ratio))
In this section, we will use some of the traditional Machine learning models to see how the results are achieved for both kind of data approach that we are using.
We will use:
#Create a new dataframe to save the accuracy scores of all the models
Compare_Models = pd.DataFrame(columns= ['Accuracy','F1 Score'])
#Common function to append all the model results for comparion
def AppendModelResults(IndexName, AccuracyScore, F1Score):
global Compare_Models
new_row = [AccuracyScore,F1Score]
Compare_Models = Compare_Models.append(pd.Series(new_row, index=Compare_Models.columns, name=IndexName))
display(Compare_Models)
# Model1 : Logistic regression model to build, fit and predict the target class
#build the pipeline
lr = Pipeline([('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', LogisticRegression()),
])
#fit the model with training data
lr.fit(X_train, y_train)
#predict the model for test data
y_pred = lr.predict(X_test)
#calculate scores and print classification report
LR_Accuracy_Score = accuracy_score(y_test,y_pred)
print('accuracy %s' % LR_Accuracy_Score)
LR_F1_Score = f1_score(y_test,y_pred, average='weighted')
print('Testing F1 score: {}'.format(LR_F1_Score))
print(classification_report(y_test, y_pred))
AppendModelResults('LR_Model', LR_Accuracy_Score, LR_F1_Score)
lr_grped = Pipeline([('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', LogisticRegression()),
])
lr_grped.fit(X_train_Grped, y_train_Grped)
y_pred_grped = lr_grped.predict(X_test_Grped)
LR_Grp_Accuracy_Score = accuracy_score(y_test_Grped,y_pred_grped)
print('accuracy %s' % LR_Grp_Accuracy_Score)
LR_Grp_F1_Score = f1_score(y_test_Grped,y_pred_grped, average='weighted')
print('Testing F1 score: {}'.format(LR_Grp_F1_Score))
print(classification_report(y_test_Grped, y_pred_grped))
AppendModelResults('LR_Grped', LR_Grp_Accuracy_Score, LR_Grp_F1_Score)
Observations:
But we can see how the behavior is with other models as well
#Model 2: Support Vector Classifier
svc = Pipeline([('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', SVC()),
])
svc.fit(X_train, y_train)
y_pred = svc.predict(X_test)
SVC_Accuracy_Score = accuracy_score(y_test, y_pred)
print('accuracy %s' % SVC_Accuracy_Score)
SVC_F1_Score = f1_score(y_test, y_pred, average='weighted')
print('Testing F1 score: {}'.format(SVC_F1_Score))
print(classification_report(y_test, y_pred))
AppendModelResults('SVC Model', SVC_Accuracy_Score, SVC_F1_Score)
svc = Pipeline([('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', SVC()),
])
svc.fit(X_train_Grped, y_train_Grped)
y_pred_grped= svc.predict(X_test_Grped)
SVC_Grped_Accuracy_Score = accuracy_score(y_test_Grped, y_pred_grped)
print('accuracy %s' % SVC_Grped_Accuracy_Score)
SVC_Grped_F1_Score = f1_score(y_test_Grped, y_pred_grped, average='weighted')
print('Testing F1 score: {}'.format(SVC_Grped_F1_Score))
print(classification_report(y_test_Grped, y_pred_grped))
AppendModelResults('SVC_Grped',SVC_Grped_Accuracy_Score,SVC_Grped_F1_Score)
#Model 3: Decision Tree Classifier
nb = Pipeline([('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', DecisionTreeClassifier()),
])
nb.fit(X_train, y_train)
y_pred = nb.predict(X_test)
DT_Accuracy_Score = accuracy_score(y_test, y_pred)
print('accuracy %s' % DT_Accuracy_Score)
DT_F1_Score = f1_score(y_test, y_pred, average='weighted')
print('Testing F1 score: {}'.format(DT_F1_Score))
print(classification_report(y_test, y_pred))
AppendModelResults('Decision Tree', DT_Accuracy_Score, DT_F1_Score)
#Model 3: Decision Tree Classifier
nb = Pipeline([('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', DecisionTreeClassifier()),
])
nb.fit(X_train_Grped, y_train_Grped)
y_pred_grped = nb.predict(X_test_Grped)
DT_Grped_Accuracy_Score = accuracy_score(y_test_Grped, y_pred_grped)
print('accuracy %s' % DT_Grped_Accuracy_Score)
DT_Grped_F1_Score = f1_score(y_test_Grped, y_pred_grped, average='weighted')
print('Testing F1 score: {}'.format(DT_Grped_F1_Score))
print(classification_report(y_test_Grped, y_pred_grped))
AppendModelResults('Decision Tree Grped', DT_Grped_Accuracy_Score, DT_Grped_F1_Score)
#Model 4: Random Forest Classifier
nb = Pipeline([('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', RandomForestClassifier()),
])
nb.fit(X_train, y_train)
y_pred = nb.predict(X_test)
RF_Accuracy_Score = accuracy_score(y_test, y_pred)
print('accuracy %s' % RF_Accuracy_Score )
RF_F1_Score = f1_score(y_test, y_pred, average='weighted')
print('Testing F1 score: {}'.format(RF_F1_Score))
print(classification_report(y_test, y_pred))
AppendModelResults('Random Forest', RF_Accuracy_Score, RF_F1_Score)
#Model 4: Random Forest Classifier
nb = Pipeline([('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', RandomForestClassifier()),
])
nb.fit(X_train_Grped, y_train_Grped)
y_pred_grped = nb.predict(X_test_Grped)
RF_Grped_Accuracy_Score = accuracy_score(y_test_Grped, y_pred_grped)
print('accuracy %s' % RF_Grped_Accuracy_Score )
RF_Grped_F1_Score = f1_score(y_test_Grped, y_pred_grped, average='weighted')
print('Testing F1 score: {}'.format(RF_Grped_F1_Score))
print(classification_report(y_test_Grped, y_pred_grped))
AppendModelResults('Random Forest Grped', RF_Grped_Accuracy_Score, RF_Grped_F1_Score)
#Model 5: Adaboost Classifier
nb = Pipeline([('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', AdaBoostClassifier()),
])
nb.fit(X_train, y_train)
y_pred = nb.predict(X_test)
ABC_Accuarcy_Score = accuracy_score(y_test, y_pred)
print('accuracy %s' % ABC_Accuarcy_Score )
ABC_F1_Score = f1_score(y_test, y_pred, average='weighted')
print('Testing F1 score: {}'.format(ABC_F1_Score))
print(classification_report(y_test, y_pred))
AppendModelResults('AdaBoost', ABC_Accuarcy_Score, ABC_F1_Score)
#Model 5: Adaboost Classifier
nb = Pipeline([('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', AdaBoostClassifier()),
])
nb.fit(X_train_Grped, y_train_Grped)
y_pred_grped = nb.predict(X_test_Grped)
ABC_Grped_Accuarcy_Score = accuracy_score(y_test_Grped,y_pred_grped)
print('accuracy %s' % ABC_Grped_Accuarcy_Score )
ABC_Grped_F1_Score = f1_score(y_test_Grped, y_pred_grped, average='weighted')
print('Testing F1 score: {}'.format(ABC_Grped_Accuarcy_Score))
print(classification_report(y_test_Grped, y_pred_grped))
AppendModelResults('AdaBoost Grped', ABC_Grped_Accuarcy_Score, ABC_Grped_F1_Score)
Observations:
Of all the models,Decision Tree and Random forest had the highest F1 and accuracy score and AdaBoost classifier had the lower accuracy and F1 score. All the other models performed similarly and there is not much difference between grouped and ungrouped data. Hence we will use the grouped data in the upcoming models.
Setting different Parameters for the model
max_features = 9000
maxlen = 100 ## Add your max length here ##
embedding_size = 100
Apply tokenizer on description column
tokenizer = KerasTokenizer(num_words=max_features, split=' ')
tokenizer.fit_on_texts(Final_Data_Grouped['Description'].values)
#Apply label encoder for assignment group
le = LabelEncoder()
Final_Data_Grouped['EncodedGroup'] = le.fit_transform(Final_Data_Grouped['AssignmentGroup'])
Final_Data_Grouped.tail()
X = tokenizer.texts_to_sequences(Final_Data_Grouped['Description'])
X = pad_sequences(X, maxlen = maxlen)
y = np.asarray(Final_Data_Grouped['EncodedGroup'])
print("Number of Samples:", len(X))
print(X[0])
print("Number of Labels: ", len(y))
print(y[0])
# Split into Test, train, validation for Ungrouped data
train_ratio = 0.60
validation_ratio = 0.20
test_ratio = 0.20
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=1 - train_ratio, random_state=0)
x_val, x_test, y_val, y_test = train_test_split(x_test, y_test, test_size=test_ratio/(test_ratio + validation_ratio), random_state=0)
#print the corresponding shapes
print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)
print(x_val.shape)
print(y_val.shape)
print(x_train)
print(y_train)
tokenizer.word_index.items()
#define the vocab size
vocab_size = len(tokenizer.word_index) + 1
print(vocab_size)
#Create word2vec embedding and save the vector values in a text file
sentences = [line.split(' ') for line in Final_Data_Grouped['Description']]
word2vec = Word2Vec(sentences=sentences,min_count=1)
word2vec.wv.save_word2vec_format(project_path+ 'word2vec_vector.txt')
# load the whole embedding into memory
embeddings1 = dict()
f = open(project_path+'word2vec_vector.txt')
for line in f:
values = line.split()
word = values[0]
coefs = np.asarray(values[1:], dtype='float32')
embeddings1[word] = coefs
f.close()
print('Loaded %s word vectors.' % len(embeddings1))
embedding_matrix = np.zeros((vocab_size, 100))
for word, i in tokenizer.word_index.items():
embedding_vector = embeddings1.get(word)
if embedding_vector is not None:
embedding_matrix[i] = embedding_vector
len(embeddings1.values())
glove_file = project_path + "glove.6B.zip"
EMBEDDING_FILE = project_path + 'glove.6B.100d.txt'
embeddings = {}
for o in open(EMBEDDING_FILE):
word = o.split(" ")[0]
# print(word)
embd = o.split(" ")[1:]
embd = np.asarray(embd, dtype='float32')
# print(embd)
embeddings[word] = embd
EMBEDDING_FILE = project_path + 'glove.6B.200d.txt'
embeddings_200 = {}
for o in open(EMBEDDING_FILE):
word = o.split(" ")[0]
# print(word)
embd = o.split(" ")[1:]
embd = np.asarray(embd, dtype='float32')
# print(embd)
embeddings_200[word] = embd
EMBEDDING_FILE = project_path + 'glove.6B.300d.txt'
embeddings_300 = {}
for o in open(EMBEDDING_FILE):
word = o.split(" ")[0]
# print(word)
embd = o.split(" ")[1:]
embd = np.asarray(embd, dtype='float32')
# print(embd)
embeddings_300[word] = embd
EMBEDDING_FILE = project_path + 'glove.6B.50d.txt'
embeddings_50 = {}
for o in open(EMBEDDING_FILE):
word = o.split(" ")[0]
# print(word)
embd = o.split(" ")[1:]
embd = np.asarray(embd, dtype='float32')
# print(embd)
embeddings_50[word] = embd
embedding_matrix_glove = np.zeros((vocab_size, 100))
for word, i in tokenizer.word_index.items():
embedding_vector = embeddings.get(word)
if embedding_vector is not None:
embedding_matrix_glove[i] = embedding_vector
embedding_matrix_glove.shape
embedding_matrix_glove_200 = np.zeros((vocab_size, 200))
for word, i in tokenizer.word_index.items():
embedding_vector_200 = embeddings_200.get(word)
if embedding_vector_200 is not None:
embedding_matrix_glove_200[i] = embedding_vector_200
embedding_matrix_glove_200.shape
embedding_matrix_glove_300 = np.zeros((vocab_size, 300))
for word, i in tokenizer.word_index.items():
embedding_vector_300 = embeddings_300.get(word)
if embedding_vector_300 is not None:
embedding_matrix_glove_300[i] = embedding_vector_300
embedding_matrix_glove_300.shape
embedding_matrix_glove_50 = np.zeros((vocab_size, 50))
for word, i in tokenizer.word_index.items():
embedding_vector_50 = embeddings_50.get(word)
if embedding_vector_50 is not None:
embedding_matrix_glove_50[i] = embedding_vector_50
embedding_matrix_glove_50.shape
#Common function to calculate accuracy and F1 score for test data
def calculate_Accuracy_F1_Score(y_test, y_pred):
mat_test = confusion_matrix(y_test,y_pred)
draw_cm(mat_test)
report = classification_report(y_test,y_pred)
accuracyScore = accuracy_score(y_test,y_pred)
f1Score = f1_score(y_test, y_pred, average='weighted')
return report, accuracyScore, f1Score
#Function to plot the confusion matrix
def draw_cm(conf_matrix):
CategoryNames = []
plt.clf()
CategoryNames = Final_Data_Grouped['AssignmentGroup'].unique()
xytics = np.arange(len(CategoryNames))
myArray = [CategoryNames, CategoryNames]
plt.figure(figsize=(30,30))
plt.imshow(conf_matrix, interpolation='nearest', cmap=plt.cm.Oranges)
thresh = conf_matrix.max() / 2
for i in range(conf_matrix.shape[0]):
for j in range(conf_matrix.shape[1]):
plt.text(j,i,format(conf_matrix[i, j]), ha='center',va='center')
plt.title('Confusion Matrix - Vizualization')
plt.xticks(xytics, labels=CategoryNames, rotation = 90)
plt.yticks(xytics, labels=CategoryNames)
plt.ylabel('True')
plt.xlabel('Predicted')
plt.tight_layout()
plt.colorbar()
plt.show()
# Visualize history
# Plot history: Validation and Training loss
def plot_loss(Model_history):
plt.clf()
plt.plot(Model_history.history['val_loss'], label='Validation Loss')
plt.plot(Model_history.history['loss'], label = 'Training Loss')
plt.title('Validation & Training loss history')
plt.ylabel('Loss value')
plt.xlabel('No. epoch')
plt.legend()
plt.show
def plot_accuracy(Model_history):
# Plot history: Validation and Training Accuracy
plt.clf()
plt.plot(Model_history.history['val_acc'], label = 'Validation Accuracy')
plt.plot(Model_history.history['acc'], label = 'Training Accuracy')
plt.title('Validation & Training accuracy history')
plt.ylabel('Accuracy value (%)')
plt.xlabel('No. epoch')
plt.legend()
plt.show()
# Define the Keras model
model = Sequential()
model.add(Embedding(vocab_size, embedding_size, input_length=maxlen))
model.add(Dropout(0.50))
model.add(Conv1D(filters=32, kernel_size=2, padding='same', activation='relu'))
model.add(Dropout(0.50))
model.add(MaxPooling1D(pool_size=2))
model.add(Flatten())
model.add(Dropout(0.50))
model.add(Dense(50, activation='softmax'))
model.compile(optimizer=Adam(lr = 0.01), loss='sparse_categorical_crossentropy', metrics=['acc'])
model.summary()
Batch_size = 100
Epochs = 5
Model_history = model.fit(x_train, y_train, batch_size = Batch_size, validation_data = (x_val,y_val), epochs = Epochs)
plot_loss(Model_history)
plot_accuracy(Model_history)
# Test the model after training
test_results = model.evaluate(x_test, y_test, verbose=False)
print(f'Test results - Loss: {test_results[0]} - Accuracy: {100*test_results[1]}%')
y_pred = model.predict_classes(x_test)
report, Seq_NLP_Accuracy_Score, Seq_NLP_F1_Score = calculate_Accuracy_F1_Score(y_test,y_pred)
print("Accuracy Score: ", Seq_NLP_Accuracy_Score)
print("F1 Score: ", Seq_NLP_F1_Score)
print("Classification Report: ")
print(report)
AppendModelResults('Sequential NLP', Seq_NLP_Accuracy_Score, Seq_NLP_F1_Score)
lstm_model = Sequential()
#Embedding layer
lstm_model.add(Embedding(vocab_size, embedding_size, weights=[embedding_matrix]))
lstm_model.add(LSTM(units=128))
lstm_model.add(Flatten())
lstm_model.add(Dropout(0.50))
lstm_model.add(Dense(50, activation='relu'))
lstm_model.add(Flatten())
lstm_model.add(Dropout(0.50))
lstm_model.add(Dense(50, activation='softmax'))
lstm_model.compile(optimizer=Adam(lr = 0.01), loss='sparse_categorical_crossentropy', metrics=['acc'])
lstm_model.summary()
Batch_size = 100
Epochs = 5
Lstm_Model_history = lstm_model.fit(x_train, y_train, batch_size = Batch_size, validation_data = (x_val,y_val), epochs = Epochs)
plot_loss(Lstm_Model_history)
plot_accuracy(Lstm_Model_history)
# Test the model after training
test_results = lstm_model.evaluate(x_test, y_test, verbose=False)
print(f'Test results - Loss: {test_results[0]} - Accuracy: {100*test_results[1]}%')
y_pred = lstm_model.predict_classes(x_test)
lstm_report, LSTM_Accuracy_Score,LSTM_F1_Score = calculate_Accuracy_F1_Score(y_test, y_pred)
print("Accuracy_Score: ", LSTM_Accuracy_Score)
print("F1 Score: ", LSTM_F1_Score)
print(lstm_report)
AppendModelResults('LSTM with Word2Vec', LSTM_Accuracy_Score, LSTM_F1_Score)
lstm_model_glove = Sequential()
#Embedding layer
lstm_model_glove.add(Embedding(vocab_size, embedding_size, weights=[embedding_matrix_glove]))
lstm_model_glove.add(LSTM(units=128))
lstm_model_glove.add(Flatten())
lstm_model_glove.add(Dropout(0.50))
lstm_model_glove.add(Dense(50, activation='tanh'))
lstm_model_glove.add(Flatten())
lstm_model_glove.add(Dropout(0.50))
lstm_model_glove.add(Dense(50, activation='softmax'))
lstm_model_glove.compile(optimizer=Adam(lr = 0.01), loss='sparse_categorical_crossentropy', metrics=['acc'])
lstm_model_glove.summary()
Batch_size = 100
Epochs = 5
Lstm_glove_Model_history = lstm_model_glove.fit(x_train, y_train, batch_size = Batch_size, validation_data = (x_val,y_val), epochs = Epochs)
plot_loss(Lstm_glove_Model_history)
plot_accuracy(Lstm_glove_Model_history)
# Test the model after training
test_results = lstm_model_glove.evaluate(x_test, y_test, verbose=False)
print(f'Test results - Loss: {test_results[0]} - Accuracy: {100*test_results[1]}%')
y_pred = lstm_model_glove.predict_classes(x_test)
lstm_glove_report, LSTM_Glove_Accuracy_Score,LSTM_Glove_F1_Score = calculate_Accuracy_F1_Score(y_test, y_pred)
print("Accuracy_Score: ", LSTM_Glove_Accuracy_Score)
print("F1 Score: ", LSTM_Glove_F1_Score)
print(lstm_glove_report)
AppendModelResults('LSTM with Glove',LSTM_Glove_Accuracy_Score, LSTM_Glove_F1_Score)
Observations:
BiDir_lstm_model_glove = Sequential()
#Embedding layer
BiDir_lstm_model_glove.add(Embedding(vocab_size, embedding_size, weights=[embedding_matrix_glove]))
BiDir_lstm_model_glove.add(Bidirectional(LSTM(units=128,recurrent_dropout=0.5,dropout=0.5)))
BiDir_lstm_model_glove.add(Flatten())
BiDir_lstm_model_glove.add(Dense(100, activation='tanh'))
BiDir_lstm_model_glove.add(Flatten())
BiDir_lstm_model_glove.add(Dropout(0.50))
BiDir_lstm_model_glove.add(Dense(50, activation='softmax'))
BiDir_lstm_model_glove.compile(optimizer=Adam(lr = 0.01), loss='sparse_categorical_crossentropy', metrics=['acc'])
BiDir_lstm_model_glove.summary()
Batch_size = 100
Epochs = 5
Bidir_Lstm_glove_Model_history = BiDir_lstm_model_glove.fit(x_train, y_train, batch_size = Batch_size, validation_data = (x_val,y_val), epochs = Epochs)
plot_loss(Bidir_Lstm_glove_Model_history)
plot_accuracy(Bidir_Lstm_glove_Model_history)
# Test the model after training
test_results = BiDir_lstm_model_glove.evaluate(x_test, y_test, verbose=False)
print(f'Test results - Loss: {test_results[0]} - Accuracy: {100*test_results[1]}%')
y_pred = BiDir_lstm_model_glove.predict_classes(x_test)
BiDir_lstm_glove_report, BiDir_LSTM_Glove_Accuracy_Score,BiDir_LSTM_Glove_F1_Score = calculate_Accuracy_F1_Score(y_test, y_pred)
print("Accuracy_Score: ", BiDir_LSTM_Glove_Accuracy_Score)
print("F1 Score: ", BiDir_LSTM_Glove_F1_Score)
print(BiDir_lstm_glove_report)
AppendModelResults('BiDirectional LSTM with Glove', BiDir_LSTM_Glove_Accuracy_Score, BiDir_LSTM_Glove_F1_Score)
Observations:
Lets tune some of the hyper parameters in the Bidirectional LSTM to see if the performance of the model improves. Hyperparameters that are going to be used in tuning are:
LearningRate = 0.01
LearningRate1 = 0.001
LearningRate2 = 0.0001
embedding_size = 100
embedding_size1 = 50
embedding_size2 = 200
embedding_size3 = 300
maxlen = 100
maxlen1 = 50
maxlen2 = 200
maxlen3 = 300
X = tokenizer.texts_to_sequences(Final_Data_Grouped['Description'])
#padding sequences based on different maxlen values (hyperparameter)
X = pad_sequences(X, maxlen = maxlen)
X1 = pad_sequences(X, maxlen = maxlen1)
X2 = pad_sequences(X, maxlen = maxlen2)
X3 = pad_sequences(X, maxlen = maxlen3)
y = np.asarray(Final_Data_Grouped['EncodedGroup'])
# Split into Test, train, validation for grouped data with 100 as the maxlen
train_ratio = 0.60
validation_ratio = 0.20
test_ratio = 0.20
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=1 - train_ratio, random_state=0)
x_val, x_test, y_val, y_test = train_test_split(x_test, y_test, test_size=test_ratio/(test_ratio + validation_ratio), random_state=0)
#print the corresponding shapes
print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)
print(x_val.shape)
print(y_val.shape)
#train_test split with max_len=50
x_train1, x_test1, y_train1, y_test1 = train_test_split(X1, y, test_size=1 - train_ratio, random_state=0)
x_val1, x_test1, y_val1, y_test1 = train_test_split(x_test1, y_test1, test_size=test_ratio/(test_ratio + validation_ratio), random_state=0)
#print the corresponding shapes
print(x_train1.shape)
print(y_train1.shape)
print(x_test1.shape)
print(y_test1.shape)
print(x_val1.shape)
print(y_val1.shape)
#train test split with maxlen = 200
x_train2, x_test2, y_train2, y_test2 = train_test_split(X2, y, test_size=1 - train_ratio, random_state=0)
x_val2, x_test2, y_val2, y_test2 = train_test_split(x_test2, y_test2, test_size=test_ratio/(test_ratio + validation_ratio), random_state=0)
#print the corresponding shapes
print(x_train2.shape)
print(y_train2.shape)
print(x_test2.shape)
print(y_test2.shape)
print(x_val2.shape)
print(y_val2.shape)
#train test split with mexlen = 300
x_train3, x_test3, y_train3, y_test3 = train_test_split(X3, y, test_size=1 - train_ratio, random_state=0)
x_val3, x_test3, y_val3, y_test3 = train_test_split(x_test3, y_test3, test_size=test_ratio/(test_ratio + validation_ratio), random_state=0)
#print the corresponding shapes
print(x_train3.shape)
print(y_train3.shape)
print(x_test3.shape)
print(y_test3.shape)
print(x_val3.shape)
print(y_val3.shape)
tokenizer.word_index.items()
#define the vocab size
vocab_size = len(tokenizer.word_index) + 1
print(vocab_size)
In this model, we have used the embedding size of 50 and maxlen as 50 and appropriate weights with 50 dimenstions have been used.
BiDir_lstm_model_glove_50 = Sequential()
#Embedding layer
BiDir_lstm_model_glove_50.add(Embedding(vocab_size, embedding_size1, input_length=maxlen1, weights=[embedding_matrix_glove_50]))
BiDir_lstm_model_glove_50.add(Bidirectional(LSTM(units=128,recurrent_dropout=0.5,dropout=0.5)))
BiDir_lstm_model_glove_50.add(Flatten())
BiDir_lstm_model_glove_50.add(Dense(50, activation='tanh'))
BiDir_lstm_model_glove_50.add(Flatten())
BiDir_lstm_model_glove_50.add(Dropout(0.50))
BiDir_lstm_model_glove_50.add(Dense(50, activation='softmax'))
BiDir_lstm_model_glove_50.compile(optimizer=Adam(lr = LearningRate), loss='sparse_categorical_crossentropy', metrics=['acc'])
BiDir_lstm_model_glove_50.summary()
Batch size is also set to 50 and epochs is set to 5 only
Batch_size = 50
Epochs = 5
Bidir_Lstm_glove_Model_50_history = BiDir_lstm_model_glove_50.fit(x_train1, y_train1, batch_size = Batch_size, validation_data = (x_val1,y_val1), epochs = Epochs)
plot_loss(Bidir_Lstm_glove_Model_50_history)
plot_accuracy(Bidir_Lstm_glove_Model_50_history)
# Test the model after training
test_results = BiDir_lstm_model_glove_50.evaluate(x_test1, y_test1, verbose=False)
print(f'Test results - Loss: {test_results[0]} - Accuracy: {100*test_results[1]}%')
y_pred = BiDir_lstm_model_glove_50.predict_classes(x_test1)
BiDir_lstm_glove_50_report, BiDir_LSTM_Glove_50_Accuracy_Score,BiDir_LSTM_Glove_50_F1_Score = calculate_Accuracy_F1_Score(y_test1, y_pred)
print("Accuracy_Score: ", BiDir_LSTM_Glove_50_Accuracy_Score)
print("F1 Score: ", BiDir_LSTM_Glove_50_F1_Score)
print(BiDir_lstm_glove_50_report)
AppendModelResults('BiDirectional LSTM with 50 Dimensions', BiDir_LSTM_Glove_50_Accuracy_Score, BiDir_LSTM_Glove_50_F1_Score)
Observations:
Now lets try with 200 dimensions. Here Embedding size, maxlen, weights are defined with 200 dimensions. We have also removed all the flattening, dropout and additional activation layers. Learning rate used here is 0.001
BiDir_lstm_model_glove_200 = Sequential()
#Embedding layer
BiDir_lstm_model_glove_200.add(Embedding(vocab_size, embedding_size2, input_length=maxlen2, weights=[embedding_matrix_glove_200]))
BiDir_lstm_model_glove_200.add(Bidirectional(LSTM(units=128,recurrent_dropout=0.5,dropout=0.5)))
BiDir_lstm_model_glove_200.add(Dense(50, activation='softmax'))
BiDir_lstm_model_glove_200.compile(optimizer=Adam(lr = LearningRate1), loss='sparse_categorical_crossentropy', metrics=['acc'])
BiDir_lstm_model_glove_200.summary()
Have used the batch size of 200 with 5 Epochs still and also have used call backs such as model checkpoint, early stopping and Reduce Learning Rate.
Batch_size = 200
Epochs = 5
model_checkpoint = ModelCheckpoint("results_{val_loss:.2f}", save_best_only=True,
verbose=1)
early_stopping = EarlyStopping(patience=5, verbose=1)
reduce_lr = ReduceLROnPlateau(patience=4, min_lr=1e-05, factor=0.1)
Bidir_Lstm_glove_Model_200_history = BiDir_lstm_model_glove_200.fit(x_train2, y_train2, batch_size = Batch_size, callbacks=[model_checkpoint,early_stopping,reduce_lr],validation_data = (x_val2,y_val2), epochs = Epochs)
plot_loss(Bidir_Lstm_glove_Model_200_history)
plot_accuracy(Bidir_Lstm_glove_Model_200_history)
# Test the model after training
test_results = BiDir_lstm_model_glove_200.evaluate(x_test2, y_test2, verbose=False)
print(f'Test results - Loss: {test_results[0]} - Accuracy: {100*test_results[1]}%')
y_pred = BiDir_lstm_model_glove_200.predict_classes(x_test2)
BiDir_lstm_glove_200_report, BiDir_LSTM_Glove_200_Accuracy_Score,BiDir_LSTM_Glove_200_F1_Score = calculate_Accuracy_F1_Score(y_test2, y_pred)
print("Accuracy_Score: ", BiDir_LSTM_Glove_200_Accuracy_Score)
print("F1 Score: ", BiDir_LSTM_Glove_200_F1_Score)
print(BiDir_lstm_glove_200_report)
AppendModelResults('Bidirectional LSTM with 200 Dimensions', BiDir_LSTM_Glove_200_Accuracy_Score, BiDir_LSTM_Glove_200_F1_Score)
Observations:
BiDir_lstm_model_glove_300 = Sequential()
#Embedding layer
BiDir_lstm_model_glove_300.add(Embedding(vocab_size, embedding_size3, input_length=maxlen3, weights=[embedding_matrix_glove_300]))
BiDir_lstm_model_glove_300.add(Bidirectional(LSTM(units=256,recurrent_dropout=0.5,dropout=0.5)))
BiDir_lstm_model_glove_300.add(Dense(50, activation='softmax'))
BiDir_lstm_model_glove_300.compile(optimizer=Adam(lr = LearningRate1), loss='sparse_categorical_crossentropy', metrics=['acc'])
BiDir_lstm_model_glove_300.summary()
Also have used:
Batch_size = 300
Epochs = 5
model_checkpoint = ModelCheckpoint("results_{val_loss:.2f}", save_best_only=True,
verbose=1)
early_stopping = EarlyStopping(patience=5, verbose=1)
reduce_lr = ReduceLROnPlateau(patience=4, min_lr=1e-05, factor=0.1)
Bidir_Lstm_glove_Model_300_history = BiDir_lstm_model_glove_300.fit(x_train3, y_train3, batch_size = Batch_size, callbacks=[model_checkpoint,early_stopping,reduce_lr],validation_data = (x_val3,y_val3), epochs = Epochs)
plot_loss(Bidir_Lstm_glove_Model_300_history)
plot_accuracy(Bidir_Lstm_glove_Model_300_history)
# Test the model after training
test_results = BiDir_lstm_model_glove_300.evaluate(x_test3, y_test3, verbose=True)
print(f'Test results - Loss: {test_results[0]} - Accuracy: {100*test_results[1]}%')
y_pred = BiDir_lstm_model_glove_300.predict_classes(x_test3)
BiDir_lstm_glove_300_report, BiDir_LSTM_Glove_300_Accuracy_Score,BiDir_LSTM_Glove_300_F1_Score = calculate_Accuracy_F1_Score(y_test3, y_pred)
print("Accuracy_Score: ", BiDir_LSTM_Glove_300_Accuracy_Score)
print("F1 Score: ", BiDir_LSTM_Glove_300_F1_Score)
print(BiDir_lstm_glove_300_report)
AppendModelResults('BiDirectional LSTM with 300 Dimensions', BiDir_LSTM_Glove_300_Accuracy_Score, BiDir_LSTM_Glove_300_F1_Score)
Observations:
BiDir_lstm_model_glove1 = Sequential()
#Embedding layer
BiDir_lstm_model_glove1.add(Embedding(vocab_size, embedding_size, weights=[embedding_matrix_glove]))
BiDir_lstm_model_glove1.add(Bidirectional(LSTM(units=128,recurrent_dropout=0.5,dropout=0.5)))
BiDir_lstm_model_glove1.add(Flatten())
BiDir_lstm_model_glove1.add(Dense(50, activation='tanh'))
BiDir_lstm_model_glove1.add(Flatten())
BiDir_lstm_model_glove1.add(Dropout(0.50))
BiDir_lstm_model_glove1.add(Dense(50, activation='softmax'))
BiDir_lstm_model_glove1.compile(optimizer=Adam(lr = 0.0001), loss='sparse_categorical_crossentropy', metrics=['acc'])
BiDir_lstm_model_glove1.summary()
Batch Size = 100 Epochs = 10
Batch_size = 100
Epochs = 10
model_checkpoint = ModelCheckpoint("results_{val_loss:.2f}", save_best_only=True,verbose=1)
early_stopping = EarlyStopping(patience=5, verbose=1)
reduce_lr = ReduceLROnPlateau(patience=4, min_lr=1e-05, factor=0.1)
Bidir_Lstm_glove_Model_history1 = BiDir_lstm_model_glove1.fit(x_train, y_train, batch_size = Batch_size, callbacks=[model_checkpoint,early_stopping,reduce_lr], validation_data = (x_val,y_val), epochs = Epochs)
plot_loss(Bidir_Lstm_glove_Model_history1)
plot_accuracy(Bidir_Lstm_glove_Model_history1)
# Test the model after training
test_results = BiDir_lstm_model_glove1.evaluate(x_test, y_test, verbose=False)
print(f'Test results - Loss: {test_results[0]} - Accuracy: {100*test_results[1]}%')
y_pred = BiDir_lstm_model_glove1.predict_classes(x_test)
BiDir_lstm_glove1_report, BiDir_LSTM_Glove1_Accuracy_Score,BiDir_LSTM_Glove1_F1_Score = calculate_Accuracy_F1_Score(y_test, y_pred)
print("Accuracy_Score: ", BiDir_LSTM_Glove1_Accuracy_Score)
print("F1 Score: ", BiDir_LSTM_Glove1_F1_Score)
print(BiDir_lstm_glove1_report)
AppendModelResults('BiDirectional LSTM with LR 0.0001', BiDir_LSTM_Glove1_Accuracy_Score, BiDir_LSTM_Glove1_F1_Score)
Observations:
BiDir_lstm_model_glove2 = Sequential()
#Embedding layer
BiDir_lstm_model_glove2.add(Embedding(vocab_size, embedding_size, input_length = maxlen, weights=[embedding_matrix_glove]))
BiDir_lstm_model_glove2.add(Bidirectional(LSTM(units=256,recurrent_dropout=0.5,dropout=0.5)))
BiDir_lstm_model_glove2.add(Flatten())
BiDir_lstm_model_glove2.add(Dropout(0.50))
BiDir_lstm_model_glove2.add(Dense(50, activation='relu'))
BiDir_lstm_model_glove2.add(Flatten())
BiDir_lstm_model_glove2.add(Dropout(0.50))
BiDir_lstm_model_glove2.add(Dense(50, activation='softmax'))
BiDir_lstm_model_glove2.compile(optimizer=Adam(lr = 0.01), loss='sparse_categorical_crossentropy', metrics=['acc'])
BiDir_lstm_model_glove2.summary()
Batch_size = 100
Epochs = 5
model_checkpoint = ModelCheckpoint("results_{val_loss:.2f}", save_best_only=True,verbose=1)
early_stopping = EarlyStopping(patience=5, verbose=1)
reduce_lr = ReduceLROnPlateau(patience=4, min_lr=1e-05, factor=0.1)
Bidir_Lstm_glove_Model_history2 = BiDir_lstm_model_glove2.fit(x_train, y_train, batch_size = Batch_size, callbacks=[model_checkpoint,early_stopping,reduce_lr], validation_data = (x_val,y_val), epochs = Epochs)
plot_loss(Bidir_Lstm_glove_Model_history2)
plot_accuracy(Bidir_Lstm_glove_Model_history2)
# Test the model after training
test_results = BiDir_lstm_model_glove2.evaluate(x_test, y_test, verbose=False)
print(f'Test results - Loss: {test_results[0]} - Accuracy: {100*test_results[1]}%')
y_pred = BiDir_lstm_model_glove2.predict_classes(x_test)
BiDir_lstm_glove2_report, BiDir_LSTM_Glove2_Accuracy_Score,BiDir_LSTM_Glove2_F1_Score = calculate_Accuracy_F1_Score(y_test, y_pred)
print("Accuracy_Score: ", BiDir_LSTM_Glove2_Accuracy_Score)
print("F1 Score: ", BiDir_LSTM_Glove2_F1_Score)
print(BiDir_lstm_glove2_report)
AppendModelResults('BiDirectional LSTM with 256 units',BiDir_LSTM_Glove2_Accuracy_Score,BiDir_LSTM_Glove2_F1_Score)
Observations:
BiDir_lstm_model_glove3 = Sequential()
#Embedding layer
BiDir_lstm_model_glove3.add(Embedding(vocab_size, embedding_size, weights=[embedding_matrix_glove],trainable=True))
BiDir_lstm_model_glove3.add(Bidirectional(LSTM(units=128,recurrent_dropout=0.5,dropout=0.5, return_sequences=True)))
BiDir_lstm_model_glove3.add(LSTM(128, return_sequences=True))
BiDir_lstm_model_glove3.add(LSTM(64))
BiDir_lstm_model_glove3.add(Dense(50, activation='softmax'))
BiDir_lstm_model_glove3.compile(optimizer=Adam(lr = 0.01), loss='sparse_categorical_crossentropy', metrics=['acc'])
BiDir_lstm_model_glove3.summary()
Batch_size = 100
Epochs = 5
model_checkpoint = ModelCheckpoint("results_{val_loss:.2f}", save_best_only=True,verbose=1)
early_stopping = EarlyStopping(patience=5, verbose=1)
reduce_lr = ReduceLROnPlateau(patience=4, min_lr=1e-05, factor=0.1)
Bidir_Lstm_glove_Model_history3 = BiDir_lstm_model_glove3.fit(x_train, y_train, batch_size = Batch_size, callbacks=[model_checkpoint,early_stopping,reduce_lr], validation_data = (x_val,y_val), epochs = Epochs)
plot_loss(Bidir_Lstm_glove_Model_history3)
plot_accuracy(Bidir_Lstm_glove_Model_history3)
# Test the model after training
test_results = BiDir_lstm_model_glove3.evaluate(x_test, y_test, verbose=False)
print(f'Test results - Loss: {test_results[0]} - Accuracy: {100*test_results[1]}%')
y_pred = BiDir_lstm_model_glove3.predict_classes(x_test)
BiDir_lstm_glove3_report, BiDir_LSTM_Glove3_Accuracy_Score,BiDir_LSTM_Glove3_F1_Score = calculate_Accuracy_F1_Score(y_test, y_pred)
print("Accuracy_Score: ", BiDir_LSTM_Glove3_Accuracy_Score)
print("F1 Score: ", BiDir_LSTM_Glove3_F1_Score)
print(BiDir_lstm_glove3_report)
AppendModelResults('Multiple LSTM',BiDir_LSTM_Glove3_Accuracy_Score,BiDir_LSTM_Glove3_F1_Score)
Observations:
This model contains Bidirectional lstm model, with 2 LSTM layers and have set the trainable parameter to True and return_sequences to True.
BiDir_lstm_model_glove_final = Sequential()
#Embedding layer
BiDir_lstm_model_glove_final.add(Embedding(vocab_size, embedding_size, input_length=maxlen, weights=[embedding_matrix_glove],trainable=True))
BiDir_lstm_model_glove_final.add(Bidirectional(LSTM(units=128,recurrent_dropout=0.5,dropout=0.5, return_sequences=True)))
BiDir_lstm_model_glove_final.add(Bidirectional(LSTM(units=128)))
BiDir_lstm_model_glove_final.add(Flatten())
BiDir_lstm_model_glove_final.add(Dense(100, activation='tanh'))
BiDir_lstm_model_glove_final.add(Flatten())
BiDir_lstm_model_glove_final.add(Dropout(0.50))
BiDir_lstm_model_glove_final.add(Dense(50, activation='softmax'))
BiDir_lstm_model_glove_final.compile(optimizer=Adam(lr = 0.01), loss='sparse_categorical_crossentropy', metrics=['acc'])
BiDir_lstm_model_glove_final.summary()
Batch_size = 100
Epochs = 5
model_checkpoint = ModelCheckpoint("results_{val_loss:.2f}", save_best_only=True,verbose=1)
early_stopping = EarlyStopping(patience=5, verbose=1)
reduce_lr = ReduceLROnPlateau(patience=4, min_lr=1e-05, factor=0.1)
Bidir_Lstm_glove_Model_history_final = BiDir_lstm_model_glove_final.fit(x_train, y_train, batch_size = Batch_size, callbacks=[model_checkpoint,early_stopping,reduce_lr], validation_data = (x_val,y_val), epochs = Epochs)
plot_loss(Bidir_Lstm_glove_Model_history_final)
plot_accuracy(Bidir_Lstm_glove_Model_history_final)
# Test the model after training
test_results = BiDir_lstm_model_glove_final.evaluate(x_test, y_test, verbose=False)
print(f'Test results - Loss: {test_results[0]} - Accuracy: {100*test_results[1]}%')
y_pred = BiDir_lstm_model_glove_final.predict_classes(x_test)
BiDir_lstm_glove_final_report, BiDir_LSTM_Glove_final_Accuracy_Score,BiDir_LSTM_Glove_final_F1_Score = calculate_Accuracy_F1_Score(y_test, y_pred)
print("Accuracy_Score: ", BiDir_LSTM_Glove_final_Accuracy_Score)
print("F1 Score: ", BiDir_LSTM_Glove_final_F1_Score)
print(BiDir_lstm_glove_final_report)
AppendModelResults('LSTM with Return Sequences',BiDir_LSTM_Glove_final_Accuracy_Score,BiDir_LSTM_Glove_final_F1_Score)
Observations:
LSTM with return sequences have performed well compared to the other LSTM models and also the second best performing model after the base Bidirectional LSTM model with Glove Embedding with respect to F1 Score and accuracy.
Even though we tried tuning the model with multiple hyper parameter values, the model performance did not improve drastically from the base model. This shows the skewness/imbalance in the data.
The given data is highly imbalanced with having more records in GRP_0 and this affects the model performance and accuracy.
RNN_model = Sequential()
#Embedding layer
RNN_model.add(Embedding(vocab_size, embedding_size, input_length=maxlen, weights=[embedding_matrix_glove],trainable=True))
RNN_model.add(Conv1D(100,10,activation='tanh'))
RNN_model.add(MaxPooling1D(pool_size=2))
RNN_model.add(Dropout(0.3))
RNN_model.add(Conv1D(100,10,activation='tanh'))
RNN_model.add(MaxPooling1D(pool_size=2))
RNN_model.add(Bidirectional(LSTM(units=128)))
RNN_model.add(Dropout(0.3))
RNN_model.add(Dense(100, activation='tanh'))
RNN_model.add(Dense(50, activation='softmax'))
RNN_model.compile(optimizer=Adam(lr = 0.01), loss='sparse_categorical_crossentropy', metrics=['acc'])
RNN_model.summary()
Batch_size = 100
Epochs = 5
model_checkpoint = ModelCheckpoint("results_{val_loss:.2f}", save_best_only=True,verbose=1,monitor='val_acc')
early_stopping = EarlyStopping(patience=5, verbose=1)
reduce_lr = ReduceLROnPlateau(patience=2, min_lr=1e-04, factor=0.2, monitor='val_loss')
RNN_model_history = RNN_model.fit(x_train, y_train, batch_size = Batch_size, callbacks=[model_checkpoint,early_stopping,reduce_lr], validation_data = (x_val,y_val), epochs = Epochs)
plot_loss(RNN_model_history)
plot_accuracy(RNN_model_history)
# Test the model after training
test_results = RNN_model.evaluate(x_test, y_test, verbose=False)
print(f'Test results - Loss: {test_results[0]} - Accuracy: {100*test_results[1]}%')
y_pred = RNN_model.predict_classes(x_test)
RNN_model_report, RNN_model_Accuracy_Score,RNN_model_F1_Score = calculate_Accuracy_F1_Score(y_test, y_pred)
print("Accuracy_Score: ", RNN_model_Accuracy_Score)
print("F1 Score: ", RNN_model_F1_Score)
print(RNN_model_report)
AppendModelResults('RNN Model', RNN_model_Accuracy_Score,RNN_model_F1_Score)
Observations:
This GRU model (Gated Recurrent Unit) is a different variation of LSTM that has update gate and reset gate. These gates decide what should be passed to the output. This keeps the relevant information and pass it to the next steps. In this GRU model,
GRU_model = Sequential()
#Embedding layer
GRU_model.add(Embedding(vocab_size, embedding_size, input_length=maxlen, weights=[embedding_matrix_glove],trainable=True))
GRU_model.add(GRU(units=128))
GRU_model.add(Dropout(0.3))
GRU_model.add(Dense(100, activation='tanh'))
GRU_model.add(Dense(50, activation='softmax'))
GRU_model.compile(optimizer=Adam(lr = 0.01), loss='sparse_categorical_crossentropy', metrics=['acc'])
GRU_model.summary()
Batch_size = 100
Epochs = 5
model_checkpoint = ModelCheckpoint("results_{val_loss:.2f}", save_best_only=True,verbose=1,monitor='val_acc')
early_stopping = EarlyStopping(patience=5, verbose=1)
reduce_lr = ReduceLROnPlateau(patience=2, min_lr=1e-04, factor=0.2, monitor='val_loss')
GRU_model_history = GRU_model.fit(x_train, y_train, batch_size = Batch_size, callbacks=[model_checkpoint,early_stopping,reduce_lr], validation_data = (x_val,y_val), epochs = Epochs)
plot_loss(GRU_model_history)
plot_accuracy(GRU_model_history)
# Test the model after training
test_results = GRU_model.evaluate(x_test, y_test, verbose=False)
print(f'Test results - Loss: {test_results[0]} - Accuracy: {100*test_results[1]}%')
y_pred = GRU_model.predict_classes(x_test)
GRU_model_report, GRU_model_Accuracy_Score,GRU_model_F1_Score = calculate_Accuracy_F1_Score(y_test, y_pred)
print("Accuracy_Score: ", GRU_model_Accuracy_Score)
print("F1 Score: ", GRU_model_F1_Score)
print(GRU_model_report)
AppendModelResults('GRU Model',GRU_model_Accuracy_Score,GRU_model_F1_Score)
Observations:
In this section, we will see 2 state of the art models.
Universal Language Model FIne-Tuning(ULMFIT) is a transfer learning technique which can help in various NLP tasks. ULMFiT involves 3 major stages:
The method is universal:
from fastai.text import *
n_epochs = 5 # how many times to iterate over all samples
n_splits = 5 # Number of K-fold Splits
SEED = 10
debug = 0
ULMData = Final_Data_Grouped.copy()
print(ULMData.shape)
ULMData.head()
ULMData = ULMData.drop(['length','num_words','bins','BagOfWords','AssignmentGroup'],axis=1)
ULMData.head()
#Split the data into Train, test and validation data frames
rng = RandomState()
train_data = ULMData.sample(frac=0.8, random_state=rng)
test_df = ULMData.loc[~ULMData.index.isin(train_data.index)]
print(train_data.shape,test_df.shape)
rng1 = RandomState()
train_df = train_data.sample(frac=0.75, random_state=rng1)
valid_df = train_data.loc[~train_data.index.isin(train_df.index)]
print(train_df.shape, valid_df.shape, test_df.shape)
# Language model data : We use test_df as validation for language model
data_lm = TextLMDataBunch.from_df(path = "",train_df= train_df ,valid_df = valid_df,test_df=test_df,text_cols='Description', label_cols='EncodedGroup')
data_lm.show_batch()
#train the model with model learner and LSTM algorithm
learn = language_model_learner(data_lm, AWD_LSTM, drop_mult=0.3)
#fit the trained model
learn.fit_one_cycle(1, 1e-2)
As seen above, the accuracy is very low at 11%. Lets call the learning rate find method to find the best learning rate that can be used for the model.
#find the best learning rate
learn.lr_find()
#plot the findings
learn.recorder.plot(suggestion=True)
min_lr = learn.recorder.min_grad_lr
print(min_lr)
Observation:
Initially the loss starts reducing and after 5-6 epochs, the loss started increasing again as shown in the graph above. Now, lets try to use the best learning rate for which the loss is low and fit the model
learn.fit_one_cycle(2, min_lr)
Observation:
With the best learning rate, the accuracy has increased from 11% to 31%
#Save the model
learn.save('fit_head')
#unfreezing the layers in the model
learn.unfreeze()
#fitting the model with 5 epochs
learn.fit_one_cycle(5, 1e-3,moms=(0.9,0.8))
#Save the model
learn.save('fine_tuned')
#Save the model encoder
learn.save_encoder('fine_tuned_enc')
learn.predict('password', n_words=10)
Observation:
Unfreezing the layers did not improve the accuracy much from 30%
# Creating Classification Data
print("Creating Classification Data")
data_clas = TextClasDataBunch.from_df(path ="", train_df=train_df, valid_df =valid_df,vocab=data_lm.train_ds.vocab, bs=32,label_cols = 'EncodedGroup',text_cols='Description')
data_clas.show_batch()
print("Creating Classifier Object")
claslearn = text_classifier_learner(data_clas, AWD_LSTM, drop_mult=0.5)
claslearn.load_encoder('fine_tuned_enc')
claslearn.lr_find()
claslearn.recorder.plot(suggestion=True)
min_grad_lr = claslearn.recorder.min_grad_lr
claslearn.fit_one_cycle(2, min_grad_lr)
Observation:
With classifier model, the accuracy has improved to 55%
claslearn.recorder.plot_losses()
#unfreezing the last 2 layers alone
claslearn.freeze_to(-2)
claslearn.fit_one_cycle(5, slice(5e-3, 2e-3), moms=(0.8,0.7))
Observation:
The accuracy has improved to 61% when last 2 layers are unfreezed.
claslearn.recorder.plot_losses()
claslearn.unfreeze()
claslearn.fit_one_cycle(5, slice(2e-3/100, 2e-3), moms=(0.8,0.7))
preds = claslearn.predict(test_df.Description)
print(preds)
def evaluate():
texts = test_df['Description'].values
labels = test_df['EncodedGroup'].values
preds = []
for t in texts:
preds.append(claslearn.predict(t)[1].numpy())
return preds, labels
preds, labels = evaluate()
ULMReport, ULMAccuracyScore, ULMF1Score = calculate_Accuracy_F1_Score(labels,preds)
print("Accuracy_Score: ", ULMAccuracyScore)
print("F1 Score: ", ULMF1Score)
print(ULMReport)
AppendModelResults('ULMFit Model',ULMAccuracyScore, ULMF1Score)
Observation:
#Exporting the compare_models dataframe to csv before running the bert model
#Since bert model uses tensorflor 1.15, uninstalling the existing tensorflow version
#will cause all the variables to be lost. So saving the results dataframe
Compare_Models.to_excel(project_path+'TestResults.xlsx',index=True, index_label='Model')
BERT stands for Bidirectional Encoder Representations from Transformers. It is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of NLP tasks.
!pip uninstall tensorflow==2.3.0
!pip install tensorflow-gpu==1.15.0
!pip install q keras==2.2.4
import tensorflow as tf
device_name = tf.test.gpu_device_name()
device_name
!wget -q https://raw.githubusercontent.com/google-research/bert/master/modeling.py
!wget -q https://raw.githubusercontent.com/google-research/bert/master/optimization.py
!wget -q https://raw.githubusercontent.com/google-research/bert/master/run_classifier.py
!wget -q https://raw.githubusercontent.com/google-research/bert/master/tokenization.py
import os
import numpy as np
import pandas as pd
import datetime
import sys
import zipfile
import modeling
import optimization
import run_classifier
import tokenization
from run_classifier import FLAGS
from tokenization import FullTokenizer
from sklearn.preprocessing import LabelEncoder
import tensorflow as tf
from sklearn.model_selection import train_test_split
import keras
import tensorflow_hub as hub
from tqdm import tqdm_notebook
from tensorflow.keras import backend as K
from tensorflow.keras.models import Model
import re
from keras.layers import Layer
import warnings
import logging
print(tf.__version__)
warnings.filterwarnings("ignore")
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' # FATAL
logging.getLogger('tensorflow').setLevel(logging.FATAL)
sess = tf.Session()
# Params for bert model and tokenization
bert_path = "https://tfhub.dev/google/bert_uncased_L-12_H-768_A-12/1"
max_seq_length = 128 # MIt Initial value was: 512
from google.colab import drive
drive.mount('/content/drive/')
#Set your project path
project_path = '/content/drive/My Drive/Colab Notebooks/Capstone Working copy/'
input_df=pd.read_excel(project_path+'FinalDataGrouped.xlsx')
input_df.head()
### New -
# Perform Label Encoding, Split into Test-train set
from sklearn.model_selection import train_test_split
train_ratio = 0.60
validation_ratio = 0.20
test_ratio = 0.20
label_encoder = LabelEncoder()
input_df['AssignmentGroup']= label_encoder.fit_transform(input_df['AssignmentGroup'])
X = input_df['Description']
y = input_df['AssignmentGroup']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1 - train_ratio)
X_val, X_test, y_val, y_test = train_test_split(X_test, y_test, test_size=test_ratio/(test_ratio + validation_ratio))
## New
train_text = X_train.astype(str)
train_text = [' '.join(t.split()[0:max_seq_length]) for t in train_text]
train_text = np.array(train_text, dtype=object)[:, np.newaxis]
train_label = y_train
val_text = X_val.astype(str)
val_text = [' '.join(t.split()[0:max_seq_length]) for t in val_text]
val_text = np.array(val_text, dtype=object)[:, np.newaxis]
val_label = y_val
test_text = X_test.astype(str)
test_text = [' '.join(t.split()[0:max_seq_length]) for t in test_text]
test_text = np.array(test_text, dtype=object)[:, np.newaxis]
# Define the class for BERT layer that needs to be used in modeling
class BertLayer(Layer):
'''BertLayer which support next output_representation param:
pooled_output: the first CLS token after adding projection layer () with shape [batch_size, 768].
sequence_output: all tokens output with shape [batch_size, max_length, 768].
mean_pooling: mean pooling of all tokens output [batch_size, max_length, 768].
You can simple fine-tune last n layers in BERT with n_fine_tune_layers parameter. For view trainable parameters call model.trainable_weights after creating model.
'''
def __init__(self, n_fine_tune_layers=10, tf_hub = None, output_representation = 'pooled_output', trainable = False, **kwargs):
self.n_fine_tune_layers = n_fine_tune_layers
self.is_trainble = trainable
self.output_size = 768
self.tf_hub = tf_hub
self.output_representation = output_representation
self.supports_masking = True
super(BertLayer, self).__init__(**kwargs)
def build(self, input_shape):
self.bert = hub.Module(
self.tf_hub,
trainable=self.is_trainble,
name="{}_module".format(self.name)
)
variables = list(self.bert.variable_map.values())
if self.is_trainble:
# 1 first remove unused layers
trainable_vars = [var for var in variables if not "/cls/" in var.name]
if self.output_representation == "sequence_output" or self.output_representation == "mean_pooling":
# 1 first remove unused pooled layers
trainable_vars = [var for var in trainable_vars if not "/pooler/" in var.name]
# Select how many layers to fine tune
trainable_vars = trainable_vars[-self.n_fine_tune_layers :]
# Add to trainable weights
for var in trainable_vars:
self._trainable_weights.append(var)
# Add non-trainable weights
for var in self.bert.variables:
if var not in self._trainable_weights:
self._non_trainable_weights.append(var)
else:
for var in variables:
self._non_trainable_weights.append(var)
super(BertLayer, self).build(input_shape)
def call(self, inputs):
inputs = [K.cast(x, dtype="int32") for x in inputs]
input_ids, input_mask, segment_ids = inputs
bert_inputs = dict(
input_ids=input_ids, input_mask=input_mask, segment_ids=segment_ids
)
result = self.bert(inputs=bert_inputs, signature="tokens", as_dict=True)
if self.output_representation == "pooled_output":
pooled = result["pooled_output"]
elif self.output_representation == "mean_pooling":
result_tmp = result["sequence_output"]
mul_mask = lambda x, m: x * tf.expand_dims(m, axis=-1)
masked_reduce_mean = lambda x, m: tf.reduce_sum(mul_mask(x, m), axis=1) / (
tf.reduce_sum(m, axis=1, keepdims=True) + 1e-10)
input_mask = tf.cast(input_mask, tf.float32)
pooled = masked_reduce_mean(result_tmp, input_mask)
elif self.output_representation == "sequence_output":
pooled = result["sequence_output"]
return pooled
def compute_mask(self, inputs, mask=None):
if self.output_representation == 'sequence_output':
inputs = [K.cast(x, dtype="bool") for x in inputs]
mask = inputs[1]
return mask
else:
return None
def compute_output_shape(self, input_shape):
if self.output_representation == "sequence_output":
return (input_shape[0][0], input_shape[0][1], self.output_size)
else:
return (input_shape[0][0], self.output_size)
# function the build the bert model using class BertLayer
def build_model(max_seq_length, tf_hub, n_classes, n_fine_tune):
in_id = keras.layers.Input(shape=(max_seq_length,), name="input_ids")
in_mask = keras.layers.Input(shape=(max_seq_length,), name="input_masks")
in_segment = keras.layers.Input(shape=(max_seq_length,), name="segment_ids")
bert_inputs = [in_id, in_mask, in_segment]
bert_output = BertLayer(n_fine_tune_layers=n_fine_tune, tf_hub = tf_hub, output_representation = 'mean_pooling', trainable = True)(bert_inputs)
drop = keras.layers.Dropout(0.3)(bert_output)
dense = keras.layers.Dense(128, activation='tanh')(drop)
drop = keras.layers.Dropout(0.3)(dense)
dense = keras.layers.Dense(64, activation='tanh')(drop)
pred = keras.layers.Dense(n_classes, activation='softmax')(dense)
model = keras.models.Model(inputs=bert_inputs, outputs=pred)
Adam = keras.optimizers.Adam(lr = 0.001)
model.compile(loss='sparse_categorical_crossentropy', optimizer=Adam, metrics=['sparse_categorical_accuracy'])
model.summary()
return model
def initialize_vars(sess):
sess.run(tf.local_variables_initializer())
sess.run(tf.global_variables_initializer())
sess.run(tf.tables_initializer())
K.set_session(sess)
n_classes = len(label_encoder.classes_)
n_fine_tune_layers = 5
model = build_model(max_seq_length, bert_path, n_classes, n_fine_tune_layers)
# Instantiate variables
initialize_vars(sess)
model.trainable_weights
class PaddingInputExample(object):
"""Fake example so the num input examples is a multiple of the batch size.
When running eval/predict on the TPU, we need to pad the number of examples
to be a multiple of the batch size, because the TPU requires a fixed batch
size. The alternative is to drop the last batch, which is bad because it means
the entire output data won't be generated.
We use this class instead of `None` because treating `None` as padding
battches could cause silent errors.
"""
class InputExample(object):
"""A single training/test example for simple sequence classification."""
def __init__(self, guid, text_a, text_b=None, label=None):
"""Constructs a InputExample.
Args:
guid: Unique id for the example.
text_a: string. The untokenized text of the first sequence. For single
sequence tasks, only this sequence must be specified.
text_b: (Optional) string. The untokenized text of the second sequence.
Only must be specified for sequence pair tasks.
label: (Optional) string. The label of the example. This should be
specified for train and dev examples, but not for test examples.
"""
self.guid = guid
self.text_a = text_a
self.text_b = text_b
self.label = label
def create_tokenizer_from_hub_module(tf_hub):
"""Get the vocab file and casing info from the Hub module."""
bert_module = hub.Module(tf_hub)
tokenization_info = bert_module(signature="tokenization_info", as_dict=True)
vocab_file, do_lower_case = sess.run(
[
tokenization_info["vocab_file"],
tokenization_info["do_lower_case"],
]
)
return FullTokenizer(vocab_file=vocab_file, do_lower_case=do_lower_case)
def convert_single_example(tokenizer, example, max_seq_length=256):
"""Converts a single `InputExample` into a single `InputFeatures`."""
if isinstance(example, PaddingInputExample):
input_ids = [0] * max_seq_length
input_mask = [0] * max_seq_length
segment_ids = [0] * max_seq_length
label = 0
return input_ids, input_mask, segment_ids, label
tokens_a = tokenizer.tokenize(example.text_a)
if len(tokens_a) > max_seq_length - 2:
tokens_a = tokens_a[0 : (max_seq_length - 2)]
tokens = []
segment_ids = []
tokens.append("[CLS]")
segment_ids.append(0)
for token in tokens_a:
tokens.append(token)
segment_ids.append(0)
tokens.append("[SEP]")
segment_ids.append(0)
#print(tokens)
input_ids = tokenizer.convert_tokens_to_ids(tokens)
# The mask has 1 for real tokens and 0 for padding tokens. Only real
# tokens are attended to.
input_mask = [1] * len(input_ids)
# Zero-pad up to the sequence length.
while len(input_ids) < max_seq_length:
input_ids.append(0)
input_mask.append(0)
segment_ids.append(0)
assert len(input_ids) == max_seq_length
assert len(input_mask) == max_seq_length
assert len(segment_ids) == max_seq_length
return input_ids, input_mask, segment_ids, example.label
def convert_examples_to_features(tokenizer, examples, max_seq_length=256):
"""Convert a set of `InputExample`s to a list of `InputFeatures`."""
input_ids, input_masks, segment_ids, labels = [], [], [], []
for example in tqdm_notebook(examples, desc="Converting examples to features"):
input_id, input_mask, segment_id, label = convert_single_example(
tokenizer, example, max_seq_length
)
input_ids.append(input_id)
input_masks.append(input_mask)
segment_ids.append(segment_id)
labels.append(label)
return (
np.array(input_ids),
np.array(input_masks),
np.array(segment_ids),
np.array(labels).reshape(-1, 1),
)
def convert_text_to_examples(texts, labels):
"""Create InputExamples"""
InputExamples = []
for text, label in zip(texts, labels):
InputExamples.append(
InputExample(guid=None, text_a=" ".join(text), text_b=None, label=label)
)
return InputExamples
# Instantiate tokenizer
tokenizer = create_tokenizer_from_hub_module(bert_path)
# Convert data to InputExample format
train_examples = convert_text_to_examples(train_text, train_label)
val_examples = convert_text_to_examples(val_text, val_label)
# Convert to features
(train_input_ids, train_input_masks, train_segment_ids, train_labels
) = convert_examples_to_features(tokenizer, train_examples, max_seq_length=max_seq_length)
(val_input_ids, val_input_masks, val_segment_ids, val_labels
) = convert_examples_to_features(tokenizer, val_examples, max_seq_length=max_seq_length)
from keras.callbacks import EarlyStopping
BATCH_SIZE = 64 # Reduced batch size from 256 to 128 (increase in accuracy), will reduce to 32 and increase Epochs
MONITOR = 'val_sparse_categorical_accuracy'
print('BATCH_SIZE is {}'.format(BATCH_SIZE))
e_stopping = EarlyStopping(monitor=MONITOR, patience=3, verbose=1, mode='max', restore_best_weights=True)
callbacks = [e_stopping]
history = model.fit(
[train_input_ids, train_input_masks, train_segment_ids],
train_labels,
validation_data = ([val_input_ids, val_input_masks, val_segment_ids], val_labels),
epochs = 10,
verbose = 1,
batch_size = BATCH_SIZE, callbacks = callbacks
)
# Visualize history
# Plot history: Validation loss
import matplotlib.pyplot as plt
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.plot(history.history['loss'], label='Training Loss')
plt.title('Validation and Training loss history')
plt.ylabel('Loss value')
plt.xlabel('No. epoch')
plt.legend()
plt.show()
# Plot history: Accuracy
plt.plot(history.history['val_sparse_categorical_accuracy'], label = 'Validation Accuracy')
plt.plot(history.history['sparse_categorical_accuracy'], label = 'Training Accuracy')
plt.title('Validation and Training accuracy history')
plt.ylabel('Accuracy value (%)')
plt.xlabel('No. epoch')
plt.legend()
plt.show()
Observation:
Training loss and Accuracy are better than Validation loss and accuracy but this is expected.
test_examples = convert_text_to_examples(test_text, np.zeros(len(test_text)))
(test_input_ids, test_input_masks, test_segment_ids, test_labels
) = convert_examples_to_features(tokenizer, test_examples, max_seq_length=max_seq_length)
# Test the model after training
test_results = model.evaluate([test_input_ids, test_input_masks, test_segment_ids], test_labels, verbose=False)
print(f'Test results - Loss: {test_results[0]} - Accuracy: {100*test_results[1]}%')
prediction = model.predict([test_input_ids, test_input_masks, test_segment_ids], verbose = 1)
preds = np.argmax(prediction, axis =1)
from sklearn.metrics import accuracy_score,f1_score,classification_report
Bert_Accuracy_Score = accuracy_score(y_test, preds)
print('accuracy %s' % Bert_Accuracy_Score)
Bert_F1_Score = f1_score(y_test, preds,average='weighted')
print('Testing F1 score: {}'.format(Bert_F1_Score))
print(classification_report(y_test, preds))
Observations:
#Load the test results data and append the accuracy score of bert model
Resultsdf = pd.read_excel(project_path + 'TestResults.xlsx')
new_row = ['BERT Model', Bert_Accuracy_Score,Bert_F1_Score]
Resultsdf = Resultsdf.append(pd.Series(new_row,index=Resultsdf.columns,name=24))
display(Resultsdf)
ax = Resultsdf.plot(x='Model', y = ['Accuracy', 'F1 Score'], kind='bar', width=0.3, figsize=(15,5))
1. Handling of Skewed Data:
The given data-set is heavily skewed based on the tickets assigned to the groups. Hence we had different approaches among us.
Approach 1 – Dropping the groups which has less tickets assigned into it and to model for the data or Combining the groups which have least tickets assigned to form a group and consider that one among the group, hence we are not dropping even one value and not taking the risk of losing even a single value.
Approach 2 – Include every group information for modelling and not dropping or combining any of them.
Finally, to address an individual person’s concern, we neither dropped nor used the group information as is, we decided to combine the least value group and treat it as one group.
2. Handling of Junk Values:
The data-set had to be cleansed and handled for numerous type of junk value and at the same time we have some essential values which might looks like junk value, so most of the project duration had gone for the finalising the approach to derive the functions we had to cleanse the data and to retain essential information for classification
3. Handling of other languages:
The given dataset had the ticket description in both English and German languages. So we had to convert the ticket descriptions in German to English before proceeding. For this we used the google translator web-service.
4. Tensorflow compatibility for BERT model
The BERT model we build used tensorflow 1.x version where as the default version of tensorflow in google colab is 2.3 and all the other deep learning models including LSTM were built using Tensorflow version 2.x. So there were challenges in integrating all the models together and we had to switch the tensorflow version to 1.x while running the BERT model.
The given problem involved classifying the ticket assignments based on the description and short description columns.
The data pre-processing was done by removing all the junk characters, translating to english text, removing stop words, tokenization, and Lemmatization.
Then bi-grams, tri-grams model and word clouds were built to understand the mapping between language and the ticket groups.
Different Machine learning Models such as Logistic Regression, SVC, Decision Tree, Random Forest and Ada Boost Classifier with TFIDF vectorization were built and executed.
Random Forest and Decision Tree Models performed well compared to other Machine learning models.
The metrics mainly used for model comparison and evaluation were Accuracy Score, F1 Score. Classification Report and confusion matrix were also generated for all the models.
Different deep learning algorithms such as Sequential NLP, Simple LSTM, Bidirectional LSTM, RNN and GRU models were also built.
The Embeddings used were Word2Vec and GLOVE. Since GLOVE embeddings performed better compared to Word2Vec, GLOVE was predominantly used in most of the models.
Hyper parameter tuning was also done for different parameters such as maxlen, embedding size, epochs, batch size, learning rate, etc
GRU Model is the best model in deep learning algorithms followed by Bidirectional LSTM with Glove.
Also, state of the art models such as ULMFit and BERT were also built and executed.
All these algorithms were run on google colabs and the code was developed on Tensorflow Keras libraries.
The test results of all these algorithms with their accuracy and F1 scores have been added to a separate dataframe and provided in the notebook. A comparison of all the Models were also done.
The accuracy and F1 score of all the models are less than 65%. This is because the given data is highly skewed with GRP_0 data. Accuracy, F1 Score < 65%
The Top 3 Performing models were:
1. ULMFit
2. GRU
3. BiDirectional LSTM with Glove
1. Data Sampling:
More data about other assignment groups need to be collected and the data needs to be sampled inorder for the algorithms to improve upon the accuracy scores.
2. Two Model Approach for Grp 0 and rest of the groups:
Another approach to tackle the skewness of data is by first running a binary classifier and identify whether the ticket belongs to GRP_0 or not. Once done, another multi-class classifier can be developed to run the model to classify the ticket in rest of the assignment groups.
3. ML Pipeline to automate the Model Buiding:
Further a ML pipeline can also be developed using apache airflow to automate the model building process so that the models can be continuously evaluated for future data.